모형 선택

모형 선택은 주어진 데이터에서 일련의 후보 모델에서 통계적 모델을 선택하는 작업이다. 가장 간단한 경우, 기존의 데이터 집합을 고려한다.^[1] 그러나 수집된 데이터가 모델 선택 문제에 잘 적합하도록 실험 설계도 이 작업에 포함될 수 있다. 예측력이나 설명력이 유사한 후보 모델을 감안할 때 가장 단순한 모델은 최선의 선택일 가능성이 높다.

모델 선택은 불확실성 하에서 의사 결정이나 최적화를 목적으로 대규모 연산 모델 집합에서 몇 가지 대표적인 모델을 선택하는 문제도 언급할 수 있다.^[2]

소개

가장 기본적인 형태에서 모델 선택은 과학 연구의 근본적인 과제 중 하나이다. 일련의 관측치를 설명하는 원리를 결정하는 것은 종종 그러한 관측치를 예측하는 수학적 모형과 직접 연결된다. 예를 들어, 갈릴레오가 기울어진 평면 실험을 할 때, 그는 공의 움직임이 그의 모델에 의해 예측된 포물선을 맞췄다는 것을 증명했다.

데이터를 생성할 수 있었던 수많은 가능한 메커니즘과 프로세스 중에서, 어떻게 하면 최고의 모델을 선택하기 시작할 수 있을까? 일반적으로 취해지는 수학적 접근방식은 일련의 후보 모델 중에서 결정된다. 이 세트는 연구자에 의해 선택되어야 한다. 적어도 초기에는 다항식 같은 단순한 모델이 종종 사용된다. 번햄 앤 앤더슨(2002)은 저서를 통해 데이터의 기초가 되는 현상학적 과정이나 메커니즘(예: 화학 반응)에 대한 이해와 같은 건전한 과학적 원리에 기초한 모델 선택의 중요성을 강조한다.

일단 후보 모델들의 집합이 선택되면, 통계 분석은 우리가 이들 모델들 중에서 가장 좋은 모델을 선택할 수 있게 해준다. 최선이 의미하는 것은 논란의 여지가 있다. 좋은 모델 선택 기법은 적합도와 단순성의 균형을 맞출 것이다. 더 복잡한 모델은 데이터에 맞게 모양을 조정할 수 있을 것이다(예: 5차 다항식이 6점을 정확히 맞출 수 있다). 그러나 추가 매개변수는 유용한 것을 나타내지 않을 수 있다. (아마도 이 6개 점은 실제로 직선에 대해 랜덤하게 분포되어 있을 것이다.) 적합도는 일반적으로 우도비 접근법 또는 이것의 근사치를 사용하여 결정되며 이는 카이-제곱 검정을 유도한다. 복잡성은 일반적으로 모형의 모수 수를 세어 측정한다.

모델 선택 기법은 주어진 데이터를 생성하는 모델의 확률과 같은 일부 물리적 양의 추정기로 간주할 수 있다. 편향과 분산은 모두 이 추정기의 품질에 대한 중요한 척도로서 효율성도 종종 고려된다.

모델 선택의 표준 예는 곡선 적합이다. 여기서 점 집합과 기타 배경 지식이 주어진 경우(예: 점은 I.I.d 샘플의 결과), 점을 생성한 함수를 설명하는 곡선을 선택해야 한다.

모형 선택의 두가지 방향

추론과 데이터로부터의 학습에는 두 가지 주요 목표가 있다. 하나는 과학적 발견, 기초적인 데이터 생성 메커니즘의 이해, 그리고 데이터의 성격 해석이다. 데이터를 통해 배우는 또 다른 목표는 미래 또는 보이지 않는 관찰을 예측하는 것이다. 두 번째 목표에서 데이터 과학자는 데이터에 대한 정확한 확률론적 설명을 반드시 포함하지는 않는다. 물론 양방향에도 관심이 있을 수 있다.

두 가지 다른 목표에 맞추어 모델 선택은 추론을 위한 모델 선택과 예측을 위한 모델 선택이라는 두 가지 방향도 가질 수 있다. 첫 번째 방향은 데이터에 대한 최선의 모델을 식별하는 것이며, 이는 과학적 해석에 대한 불확실성의 출처에 대한 신뢰할 수 있는 특성화를 제공하는 것이 바람직하다. 이 목표의 경우 선택한 모형이 표본 크기에 너무 민감하지 않은 것이 중요하다. 따라서 모델 선택을 평가하기 위한 적절한 개념은 선택 일관성이며, 이는 데이터 샘플이 충분할 경우 가장 강력한 후보가 일관성 있게 선택된다는 것을 의미한다.

두 번째 방향은 우수한 예측 성능을 제공하기 위해 모델을 기계로 선택하는 것이다. 그러나 후자의 경우, 선택된 모델은 소수의 근접한 경쟁자들 중에서 단순히 행운의 승리자가 될 수 있지만, 예측 성능은 여전히 가장 우수할 수 있다. 만일 그렇다면, 두 번째 목표(예언)에 대해서는 모델 선택은 괜찮지만, 통찰력과 해석을 위해 선택된 모델을 사용하는 것은 심각하게 신뢰할 수 없고 오해의 소지가 있을 수 있다. 더욱이, 이렇게 선택된 매우 복잡한 모델의 경우, 심지어 예측조차 선택이 이루어진 모델과 약간 다른 데이터일 뿐이다

같이 보기

출처

Aho, K.; Derryberry, D.; Peterson, T. (2014), “Model selection for ecologists: the worldviews of AIC and BIC”, 《Ecology》 95 (3): 631–636, doi:10.1890/13-1452.1, PMID 24804445
Akaike, H. (1994), 〈Implications of informational point of view on the development of statistical science〉, Bozdogan, H., 《Proceedings of the First US/JAPAN Conference on The Frontiers of Statistical Modeling: An Informational Approach—Volume 3》, Kluwer Academic Publishers, 27–38쪽
Anderson, D.R. (2008), 《Model Based Inference in the Life Sciences》, Springer, ISBN 9780387740751
Ando, T. (2010), 《Bayesian Model Selection and Statistical Modeling》, CRC Press, ISBN 9781439836156
Breiman, L. (2001), “Statistical modeling: the two cultures”, 《Statistical Science》 16: 199–231, doi:10.1214/ss/1009213726
Burnham, K.P.; Anderson, D.R. (2002), 《Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach》 2판, Springer-Verlag, ISBN 0-387-95364-7 [this has over 38000 citations on Google Scholar]
Chamberlin, T.C. (1890), “The method of multiple working hypotheses”, 《Science》 15 (366): 92–6, Bibcode:1890Sci....15R..92., doi:10.1126/science.ns-15.366.92, PMID 17782687 (reprinted 1965, Science 148: 754–759 [1] doi 10.1126/science.148.3671.754)
Claeskens, G. (2016), “Statistical model choice” (PDF), 《Annual Review of Statistics and Its Application》 3 (1): 233–256, Bibcode:2016AnRSA...3..233C, doi:10.1146/annurev-statistics-041715-033413 ^{[깨진 링크]}
Claeskens, G.; Hjort, N.L. (2008), 《Model Selection and Model Averaging》, Cambridge University Press, ISBN 9781139471800
Cox, D.R. (2006), 《Principles of Statistical Inference》, Cambridge University Press
Ding, J.; Tarokh, V.; Yang, Y. (2018), “Model Selection Techniques - An Overview”, 《IEEE Signal Processing Magazine》 35 (6): 16–34, arXiv:1810.09583, Bibcode:2018ISPM...35f..16D, doi:10.1109/MSP.2018.2867638, S2CID 53035396
Kashyap, R.L. (1982), “Optimal choice of AR and MA parts in autoregressive moving average models”, 《IEEE Transactions on Pattern Analysis and Machine Intelligence》 (IEEE), PAMI-4 (2): 99–104, doi:10.1109/TPAMI.1982.4767213, PMID 21869012, S2CID 18484243
Konishi, S.; Kitagawa, G. (2008), 《Information Criteria and Statistical Modeling》, Springer, Bibcode:2007icsm.book.....K, ISBN 9780387718866
Lahiri, P. (2001), 《Model Selection》, Institute of Mathematical Statistics
Leeb, H.; Pötscher, B. M. (2009), 〈Model selection〉, Anderson, T. G., 《Handbook of Financial Time Series》, Springer, 889–925쪽, doi:10.1007/978-3-540-71297-8_39, ISBN 978-3-540-71296-1
Lukacs, P. M.; Thompson, W. L.; Kendall, W. L.; Gould, W. R.; Doherty, P. F. Jr.; Burnham, K. P.; Anderson, D. R. (2007), “Concerns regarding a call for pluralism of information theory and hypothesis testing”, 《Journal of Applied Ecology》 44 (2): 456–460, doi:10.1111/j.1365-2664.2006.01267.x, S2CID 83816981
McQuarrie, Allan D. R.; Tsai, Chih-Ling (1998), 《Regression and Time Series Model Selection》, Singapore: World Scientific, ISBN 981-02-3242-X
Massart, P. (2007), 《Concentration Inequalities and Model Selection》, Springer
Massart, P. (2014), 〈A non-asymptotic walk in probability and statistics〉, Lin, Xihong, 《Past, Present, and Future of Statistical Science》, Chapman & Hall, 309–321쪽, ISBN 9781482204988
Navarro, D. J. (2019), “Between the Devil and the Deep Blue Sea: Tensions between scientific judgement and statistical model selection”, 《Computational Brain & Behavior》 2: 28–34, doi:10.1007/s42113-018-0019-z
Resende, Paulo Angelo Alves; Dorea, Chang Chung Yu (2016), “Model identification using the Efficient Determination Criterion”, 《Journal of Multivariate Analysis》 150: 229–244, arXiv:1409.7441, doi:10.1016/j.jmva.2016.06.002, S2CID 5469654
Shmueli, G. (2010), “To explain or to predict?”, 《Statistical Science》 25 (3): 289–310, arXiv:1101.0891, doi:10.1214/10-STS330, MR 2791669, S2CID 15900983
Stoica, P.; Selen, Y. (2004), “Model-order selection: a review of information criterion rules” (PDF), 《IEEE Signal Processing Magazine》 21 (4): 36–47, doi:10.1109/MSP.2004.1311138, S2CID 17338979
Wit, E.; van den Heuvel, E.; Romeijn, J.-W. (2012), “'All models are wrong...': an introduction to model uncertainty” (PDF), 《Statistica Neerlandica》 66 (3): 217–236, doi:10.1111/j.1467-9574.2012.00530.x, S2CID 7793470, 2021년 12월 2일에 원본 문서 (PDF)에서 보존된 문서, 2023년 10월 22일에 확인함
Wit, E.; McCullagh, P. (2001), Viana, M. A. G.; Richards, D. St. P., 편집., “The extendibility of statistical models”, 《Algebraic Methods in Statistics and Probability》, 327–340쪽
Wójtowicz, Anna; Bigaj, Tomasz (2016), 〈Justification, confirmation, and the problem of mutually exclusive hypotheses〉, Kuźniar, Adrian; Odrowąż-Sypniewska, Joanna, 《Uncovering Facts and Values》, Brill Publishers, 122–143쪽, doi:10.1163/9789004312654_009, ISBN 9789004312654
Owrang, Arash; Jansson, Magnus (2018), “A Model Selection Criterion for High-Dimensional Linear Regression”, 《IEEE Transactions on Signal Processing 》 66 (13): 3436–3446, Bibcode:2018ITSP...66.3436O, doi:10.1109/TSP.2018.2821628, ISSN 1941-0476, S2CID 46931136
B. Gohain, Prakash; Jansson, Magnus (2022), “Scale-Invariant and consistent Bayesian information criterion for order selection in linear regression models”, 《Signal Processing》 196: 108499, doi:10.1016/j.sigpro.2022.108499, ISSN 0165-1684, S2CID 246759677

각주

↑ Hastie, Tibshirani, Friedman (2009). 《The elements of statistical learning》. Springer. 195쪽. CS1 관리 - 여러 이름 (링크)
↑ Shirangi, Mehrdad G.; Durlofsky, Louis J. (2016). “A general method to select representative models for decision making and optimization under uncertainty”. 《Computers & Geosciences》 96: 109–123. Bibcode:2016CG.....96..109S. doi:10.1016/j.cageo.2016.08.002.

[1] Hastie, Tibshirani, Friedman (2009). 《The elements of statistical learning》. Springer. 195쪽. CS1 관리 - 여러 이름 (링크)

[2] Shirangi, Mehrdad G.; Durlofsky, Louis J. (2016). “A general method to select representative models for decision making and optimization under uncertainty”. 《Computers & Geosciences》 96: 109–123. Bibcode:2016CG.....96..109S. doi:10.1016/j.cageo.2016.08.002.

[1]

[2]